DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents

نویسندگان

  • Zhao Yan
  • Nan Duan
  • Jun-Wei Bao
  • Peng Chen
  • Ming Zhou
  • Zhoujun Li
  • Jianshe Zhou
چکیده

Most current chatbot engines are designed to reply to user utterances based on existing utterance-response (or Q-R)1 pairs. In this paper, we present DocChat, a novel information retrieval approach for chatbot engines that can leverage unstructured documents, instead of Q-R pairs, to respond to utterances. A learning to rank model with features designed at different levels of granularity is proposed to measure the relevance between utterances and responses directly. We evaluate our proposed approach in both English and Chinese: (i) For English, we evaluate DocChat on WikiQA and QASent, two answer sentence selection tasks, and compare it with state-of-the-art methods. Reasonable improvements and good adaptability are observed. (ii) For Chinese, we compare DocChat with XiaoIce2, a famous chitchat engine in China, and side-by-side evaluation shows that DocChat is a perfect complement for chatbot engines using Q-R pairs as main source of responses.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Map Reduce Text Clustering Using Vector Space Model

Information retrieval is the area of finding particular web pages via a query to an internet search engine. Even though well sophisticated algorithms and data structures are used in traditional computer techniques to create indexes for efficiently organize and retrieve information systems, currently data mining techniques like clustering are used to enhance the efficiency of retrieval process. ...

متن کامل

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...

متن کامل

Indexation spatiale et temporelle basée sur un principe de "tuilage" : contribution à la recherche d'information géographique dans des documents textuels faiblement structurés

Most of search engines process users’ information needs by retrieving documents from pre-built term-based indexes. Such approaches are limited regarding particular contexts or specific retrieval criteria. Our contribution concerns geographical information retrieval (GIR) and proposes to exploit both spatial and temporal facets to extend classical thematic engines in order to parse unstructured ...

متن کامل

Retrieval of Legal Documents: Combining Structured and Unstructured Information

Legal information is often accessible via portal web sites. Legal documents typically combine structured and unstructured information, the former being tagged with markup languages such as XML (Extensible Markup Language). Current information retrieval research takes into account the structured information content of documents when computing the relevance ranking. Such an approach is very promi...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016